yahoo news ranked multi-label corpus
Introducing the Yahoo News Ranked Multi-label Corpus, a Novel Dataset to Improve Multilabel Learning
Most content-based websites, like Yahoo News, HuffPost, or any given news site, organize their stories according to subject matter or in some similar way. You can imagine that websites with a huge amount of stories must need an automated method to filter or categorize them as the content is ingested into their systems. For example, algorithms that power Yahoo News label news articles with tags (e.g., Military conflict, Nuclear policy, Refugees) as they are ingested, and then display the content by subject matter and/or on a personalized feed. This well-known process of labeling content with all its relevant tags is known as Multilabel Learning (MLL). Up to now, whenever scientists and engineers use MLL to create their own specific models to label content however they like, they have used datasets that have pre-computed features like bag-of-words, or dense representations like doc2vec.
- Media > News (0.95)
- Government (0.79)